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1 Overview 

This document considers the problem of drawing samples from posterior distributions formed 
under a Dirichlet prior and a truncated multinomial likelihood, by which we mean a Multi- 
nomial likelihood function where we condition on one or more counts being zero a priori. 
An example is the distribution with density 
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where n G A := {x G M.% : J2i x i = 1 }> a e R +> and M + = i x e R : x > °}- We sa y the 
likelihood function has two truncated terms because each term corresponds to a multinomial 
likelihood defined on the full parameter tt but conditioned on the event that observations 
with a certain label are removed from the data. 

Sampling this posterior distribution is of interest in inference algorithms for hierarchical 
Bayesian models based on the Dirichlet distribution or the Dirichlet Process, particularly 
the sampling algorithm for the Hierarchical Dirichlet Process Hidden Semi-Markov Model 
(HDP-HSMM) [ ] which must draw samples from such a distribution. 

We provide an auxiliary variable (or data augmentation) [(:>] sampling algorithm that is 
easy to implement, fast both to mix and to execute, and easily scalable to high dimensions. 
This document will explicitly work with the finite Dirichlet distribution, but the sampler 
immediately generalizes to the Dirichlet Process case based on the Dirichlet Process's def- 
inition in terms of the finite Dirichlet distribution and the Komolgorov extension theorem 
[5]. 

Section 2 explains the problem in greater detail. Section 3 provides a derivation of our 
sampling algorithm. Finally, Section 4 provides numerical experiments in which we demon- 
strate the algorithm's significant advantages over a generic Metropolis-Hastings sampling 
algorithm. 

Sampler code and functions to generate each plot in this document are available at 
https : //github. com/mattj j/dirichlet-truncated-multinomial. 



2 Problem Description 



We say a vector ir g A is Dirichlet-distributed with parameter vector a € K™ if it has a 
density 
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with respect to Lebesgue measure. The Dirichlct distribution and its generalization to 
arbitrary probability spaces, the Dirichlet Process, are common in Bayesian statistics and 
machine learning models. It is most often used as a prior over finite probability mass 
functions, such as the faces of a die, and paired with the multinomial likelihood, to which 
it is conjugate, viz. 
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That is, given a count vector m € N™, the posterior distribution is also Dirichlet with an 
updated parameter vector and, therefore, it is easy to draw samples from the posterior. 

However, we consider a modified likelihood function which does not maintain the con- 
venient conjugacy property: the truncated multinomial likelihood, which corresponds to 
deleting a particular set of counts from the count vector m or, equivalently, conditioning on 
the event that they are not observed. The truncated multinomial likelihood where the first 
component is truncated can be written 
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= Mult(m|7r,{m 1 = 0}). 



In general, any subset of indices may be truncated; if a set X C {1, 2, . . . , n} is truncated, 
then we write 



TruncMultx(m|7r) := 
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where m. = Ej m i- This distribution can arise in hierarchical Bayesian models such as the 
HDP-HSMM [4]. 

In the case where the posterior is proportional to a Dirichlet prior and a single truncated 
multinomial likelihood term, the posterior is still simple to write down and sample. In this 
case, we may split the Dirichlet prior over I and its complement X :— {1, 2, . . . , n} \ X; the 
factor over X is conjugate to the likelihood, and so the posterior can be written 
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(a) The prior, Dir(7r|a). 
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(b) The posterior proportional to (c) The posterior proportional to 
Dir(7r|a) • Mult(m|7r). Dir(7r|a) ■ TruncMult {1} (m\n). 

Figure 1: Projected visualizations of the prior distribution Dir(7r|a) for n = 3 and 
a = (2, 2, 2) and the associated posterior distributions when paired with Mult(m|7r) and 
TruncMult{ij (m|7r) where m = (0, 2, 0). In low dimensions, the posteriors can be computed 
via direct numerical integration over a discretized mesh. 



from which we can easily sample. However, given two or more truncated likelihood terms 
with different truncation patterns, no simple conjugacy property holds, and so it is no 
longer straightforward to construct samples from the posterior. For a visual comparison in 
the n = 3 case, see Figure I. 

For the remainder of this document, we deal with the case where there are two likeli- 
hood terms, each with one component truncated. The generalization of the equations and 
algorithms to the case where any set of components is truncated is immediate. 

3 An Auxiliary Variable Sampler 

Data augmentation methods are auxiliary variable methods that often provide excellent 
sampling algorithms because they are easy to implement and the component steps are 
simply conjugate Gibbs sampling steps, resulting in fast mixing. For an overview, see the 
survey [6]. 

We can derive an auxiliary variable sampler for our problem by augmenting the distri- 
bution with geometric random variables k = (ki,k 2 ) — {{kij},{k 2 j})- That is, we define 
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for kij = {0, 1, 2, . . .} a new distribution q such that 
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where {run} and {m2i} are sample counts corresponding to each likelihood, respectively, 
and rrii. := • TOy. Note that if we sum over all the auxiliary variables k, we have 
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and so if we can construct samples of tt, k\m, a from the distribution q then we can form 
samples of n\m,a according to p by simply ignoring the values sampled for k. 

We construct samples of tt, k\m, a by iterating Gibbs steps between k\%, m, a and w\k, m, a. 
We see from (11) that each kij in fe| 7r, m, a = fc|7r, m is independent and distributed accord- 
ing to 

q(kij\w,m) = (1 - TTi)^ 3 ■ (15) 

Therefore, each kij follows a geometric distribution with success parameter (1 — m). 
The distribution of 7r|fc, m, a in g is also simple: 

oc (nc->) (n (r^) mi1 ) (n (r^f) ™ 
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where to is a set of augmented counts including the values of k. In other words, the Dirichlet 
prior is conjugate to the augmented model. Therefore we can cycle through Gibbs steps in 
the augmented distribution and hence easily produce samples from the desired posterior. 
For a graphical model of the augmentation, see Figure 2. 



4 Numerical Experiments 

In this section we perform several numerical experiments to demonstrate the advantages 
provided by the auxiliary variable sampler. We compare to a generic Metropolis-Hastings 
sampling algorithm. For all experiments, when a statistic is computed in terms of a sampler 
chain's samples up to sample index t, we discard the first |_§J samples and use the remaining 
samples to compute the statistic. 
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(a) Un-augmented distribution. (b) Augmented distribution. 

Figure 2: Graphical models for the un-augmented and augmented probability models. 

Metropolis-Hastings Sampler We construct an MH sampling algorithm by using the 
proposal distribution which proposes a new position tt' given the current position tt via the 
proposal distribution 

p(tt'\tt;P) = Dir( 7 r'| ( 8 • tt) (19) 

where j3 > is a tuning parameter. This proposal distribution has several valuable proper- 
ties: 

1. the mean and mode of the proposals are both tt; 

2. the parameter j3 directly controls the concentration of the proposals, so that larger /3 
correspond to more local proposal samples; 

3. the proposals are naturally confined to the support of the target distribution, while 
alternatives such as local Gaussian proposals would not satisfy the MH requirement 
that the normalization constant of the proposal kernel be constant for all starting 
points. 

In our comparison experiments, we tune j3 so that the acceptance probability is approx- 
imately 0.24. 

Sample Chain Autocorrelation In Figure 3 we compare the sample autocorrelation 
of the auxiliary variable sampler and the alternative MH sampler for several lags with 
n = 10. The reduced autocorrelation that is typical in the auxiliary variable sampler chain 
is indicative of faster mixing. 

The R Multivariate Potential Scale Reduction Factor The R statistic, also called 
the Multivariate Potential Scale Reduction Factor (MPSRF), was introduced in [1] and is a 
natural generalization of the scalar Scale Reduction Factor, introduced in [2] and discussed 
in [3, p. 296]. As a function of multiple independent sampler chains, the statistic compares 
the between-chain sample covariance matrix to the within-chain sample covariance matrix 
to measure mixing; good mixing is indicated by empirical convergence to the statistic's 
asymptotic value of unity. 
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First Component Autocorrelations 



Aux. Var. Sampler 
MH Sampler 




(a) Autocorrelations in the first (truncated) component. 
Second Component Autocorrelations 



Aux. Var. Sampler 
MH Sampler 




(b) Autocorrelations in the second component. 

Figure 3: Autocorrelations for the auxiliary variable sampler and MH sampler chains with 
oti = 2, n = 10, j3 — 160. The solid lines show the mean autocorrelation over 50 randomly- 
initialized runs for each sampler, and the dashed lines show the 10th and 90th percentile 
autocorrelation chains over those runs. These plots can be reproduced with the function 
autocorrelation in figures. py. 
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Specifically, loosely following the notation of [1], with ipty for denoting the ith. element of 



the parameter vector in chain j at time t (with i = 1, . 
to compute the n-dimensional MPSRF we form 



j = 1, ...,M, and t = 1, ...,T), 
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The MPSRF itself is then defined when W is full-rank as 
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L, Eq. 4.1 and Lemma 1] 
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where X m&x (A) denotes the eigenvalue of largest modulus of the matrix A and the last 
line follows because conjugating by is a similarity transformation. Equivalently (and 
usefully for computation), we must find the largest solution A to det(XW — V) = 0. 

However, as noted in [1, p. 446], the measure is incalculable when W is singular, and 
because our samples are constrained to lie in the simplex in n dimensions, the matrices 
involved will have rank n — l. Therefore when computing the R statistic, we simply perform 
the natural Euclidean orthogonal projection to the (n — l)-dimcnsional affine subspace on 
which our samples lie; specifically, we define the statistic in terms of Q T VQ and Q T WQ, 
where Q is an n x (n - 



1) matrix such that Q T Q 
1 
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for upper-triangular R of size (n — 1) x (n — 1). 

Figure 4 shows the MPSRF of both samplers computed over 50 sample chains for n = 10 
dimensions, and Figure 5 shows the even greater performance advantage of the auxiliary 
variable sampler in higher dimensions. 



Statistic Convergence Finally, we show the convergence of the component-wise mean 
and variance statistics for the two samplers. We estimated the true statistics by forming 
estimates using samples from 50 independent chains each with 5000 samples, effectively 
using 250000 samples to form the estimates. Next, we plotted the £2 distance between these 
"true" statistic vectors and those estimated at several sample indices along the 50 runs for 
each of the sampling algorithms. See the plots in Figure 6. 
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MH and Aux. Var. Samplers MSPRF vs Sample Indices 
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(a) The horizontal axis is the sample index. 



MH and Aux. Var. Sampler MSPRF vs Computation Time 
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(b) The horizontal axis is elapsed time. 

Figure 4: The R Multivariate Potential Scale Reduction Factor [1] for the auxiliary variable 
sampler and MH sampler with on = 2, n = 10, and /? = 160, with horizontal axes scaled 
by sample index and elapsed time. For each sampler, 5000 samples were drawn for each of 
50 randomly-initialized runs, and the MPSRF was computed at 25 equally-spaced intervals. 
These plots can be reproduced with the function Rhatp in figures. py. 
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MH and Aux. Var. Samplers MSPRF vs Sample Indices 
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(a) The horizontal axis is the sample index. 

MH and Aux. Var. Sampler MSPRF vs Computation Time 

x — x Aux. Var. Sampler 




seconds 



(b) The horizontal axis is elapsed time. 

Figure 5: The R Multivariate Potential Scale Reduction Factor [1] for the auxiliary variable 
sampler and MH sampler with on = 2, n = 20, and (3 — 160, with horizontal axes scaled 
by sample index and elapsed time. For each sampler, 5000 samples were drawn for each of 
50 randomly-initialized runs, and the MPSRF was computed at 25 equally-spaced intervals. 
These plots can be reproduced with the function Rhatp in figures. py. 
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(a) £2 error of component-wise mean esti- 
mate vs sample index. 



Mean Convergence 




(b) £2 error of component-wise variance esti- 
mate vs sample index. 



Variance Convergence 




(c) £2 error of component- wise mean estimate 
vs elapsed time. 



(d) £2 error of component-wise variance esti- 
mate vs elapsed time. 



Figure 6: Component-wise statistic convergence for the auxiliary variable sampler and MH 
sampler with oti — 2, n = 10, and ft — 160, with horizontal axes scaled by sample index and 
elapsed time. For each sampler, 5000 samples were drawn for each of 50 randomly-initialized 
runs. The £2 distances from estimated "true" parameters are plotted, with the solid lines 
corresponding to the mean error and the dashed lines corresponding to 10th and 90th 
percentile errors. These plots can be reproduced with the function statistic_convergence 
in figures .py. 
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